About the project

The project seems interesting and i enrolled to the course to learn more about r and open data science. This site will be updated as soon as the tasks are completed and a new ABOUT introduction will appear with more details about what the course contents are and more general questions about open data and open science.

Analysis of Learning Data

The learning data is a dataset collected during 2014 called International survey of Approaches to Learning. The dataset used for these analyses is a modified dataset containing only variables for gender and age of the students, as well as new variables regarding attitude toward statistics, exam points, and scores of deep learning approach, strategic learning approach and surface learning approach.

Raw analyses

First we plotted a correlation matrix with ggpairs to find correlations between variables and to visually explore the distribution of the variables. We observe that age has a left tailed distribution, meaning that the participants are mainly of young age, as expected from the cohort. The other variables follow more or less a normal distribution, apart of from attitude towards learning in men, which is slight right-tailed. We found a positive correlation between attitude and exam points. There is a negative correlation between deep learning approach and superficial learning approach. See plot below:

We obtained descriptive statistics grouped by gender and observed that mean age in males is higher than in female, as well as attitude. The other variables have similar means between the groups. See table below. Performing simple t-tests, we found only significant (p<0.001) for the differences in mean attitude between genders.

group1 mean sd median IQR
gender*1 F 1.0 0.0 1.0 0.0
gender*2 M 2.0 0.0 2.0 0.0
age1 F 24.9 7.4 22.0 6.0
age2 M 26.8 8.4 24.0 8.0
attitude1 F 3.0 0.7 3.0 1.1
attitude2 M 3.4 0.6 3.4 0.8
deep1 F 3.7 0.5 3.7 0.8
deep2 M 3.7 0.6 3.8 0.7
stra1 F 3.2 0.7 3.2 1.1
stra2 M 3.0 0.8 3.0 1.2
surf1 F 2.8 0.5 2.8 0.6
surf2 M 2.7 0.6 2.6 0.9
points1 F 22.3 5.8 23.0 7.0
points2 M 23.5 6.0 23.5 8.2

Multiple linear model

We fitted a linear model for exam points as outcome and attitude, strategy and age as explanatory variables. The summary of the model is presented below. We found that participants with higher scores in attitude had higher exam points (p<0.0001). There was a trend of higher scores in strategy having higher exam points. Controlling for gender did not modified the results of the model substantially. The summary of the model can be found below. The model has an adjusted R-squared of 0.2037, this means the model can expain or predict 20% of the data, which is quite reasonable for survey data. The model can be defined as following Exam points= 10.9+ 3.5attitude + strategy -0.1age

Dependent variable:
points
attitude 3.481***
(0.562)
stra 1.004*
(0.534)
age -0.088*
(0.053)
Constant 10.895***
(2.648)
Observations 166
R2 0.218
Adjusted R2 0.204
Residual Std. Error 5.260 (df = 162)
F Statistic 15.069*** (df = 3; 162)
Note: p<0.1; p<0.05; p<0.01

When making diagnostic plots of our model, we found a random pattern in the residuals vs. fitted plot, meaning no bias, an almost linear qq plot which indicates normality and the residuals vs. leverage plot shows no outliers whcih could affect the modelling.


Chapter 4

Use of Boston data from MASS library

Boston is dataset with 506 observations and 14 variables of housing values in the suburbs of Boston Correlation plot matrix and summaries of the variables can be found below

We observe middle-strong correlations (ca. 0.6) between the variables rad and crim, tax and crim, age and zn, dis and zn, nox and indus, age and indus, dis and indus, rad and indus, tax and indus, lstat and indus, age and nox, dis and nox, rad and nox, tax and nox, lstat and nox, lstat and rm, medv and rm, dis and age, lstat and age, tax and rad, lstat and medv. None of the variables is normally distributed, apart from rm that appears to follow a normal distribution.

Descriptive statistics (summary) of variables in Boston in following table.

mean sd median se IQR
crim 3.6 8.6 0.3 0.4 3.6
zn 11.4 23.3 0.0 1.0 12.5
indus 11.1 6.9 9.7 0.3 12.9
chas 0.1 0.3 0.0 0.0 0.0
nox 0.6 0.1 0.5 0.0 0.2
rm 6.3 0.7 6.2 0.0 0.7
age 68.6 28.1 77.5 1.3 49.0
dis 3.8 2.1 3.2 0.1 3.1
rad 9.5 8.7 5.0 0.4 20.0
tax 408.2 168.5 330.0 7.5 387.0
ptratio 18.5 2.2 19.1 0.1 2.8
black 356.7 91.3 391.4 4.1 20.8
lstat 12.7 7.1 11.4 0.3 10.0
medv 22.5 9.2 21.2 0.4 8.0

Therefore, we standardised the data.

Descriptive statistics (summary) of standardized variables in Boston in following table. Observe the mean O after standardization and standard deviation of 1

mean sd median se IQR
crim 0 1 -0.4 0 0.4
zn 0 1 -0.5 0 0.5
indus 0 1 -0.2 0 1.9
chas 0 1 -0.3 0 0.0
nox 0 1 -0.1 0 1.5
rm 0 1 -0.1 0 1.1
age 0 1 0.3 0 1.7
dis 0 1 -0.3 0 1.5
rad 0 1 -0.5 0 2.3
tax 0 1 -0.5 0 2.3
ptratio 0 1 0.3 0 1.3
black 0 1 0.4 0 0.2
lstat 0 1 -0.2 0 1.4
medv 0 1 -0.1 0 0.9

We fitted a linear discriminant analysis to the target variable crime and its classes. We divided the standardised Boston data set into a training and a test set, with 80% of the data assigned to the training dataset.

We plotted this lda model in the following biplot

## NULL
[-0.419,-0.411] (-0.411,-0.39] (-0.39,0.00739] (0.00739,9.92]
[-0.419,-0.411] 12 13 1 0
(-0.411,-0.39] 3 12 8 0
(-0.39,0.00739] 1 3 22 1
(0.00739,9.92] 0 0 0 26

As noted in the biplot the predictor variable rad ( radial highway )predicts a high crime per capita ( blue). On the other hand a high proportion of residential land zoned for lots over 25,000 sq.ft. (variable zn) predicts a low crime per capita. The middle crime per capita depicted in red in green has overlaps and is predicted by multiple predictors.

We predicted classes with the lda model and observed that the lda model predicts very efficently the crime rates above the mean (i.e. higher crime rates), but fails to distinguish the lower crime classes in an effective manner.

We calculated the distances between the observations in the standardised Boston dataset and performed an k means clustering analysis. We obtained two clusters and see that for example rad predicts well the two crime clusters.